How can deep learning be combined with theoretical linguistics?

Natural language processing is mostly done using deep learning and neural networks nowadays. In a typical NLP paper, you might see some Transformer models, some RNNs built using linear algebra and statistics, but very little linguistic theory. Is linguistics irrelevant to NLP now, or can the two fields still contribute to each other?

In a series of articles in the Language journal, Joe Pater discussed the history of neural networks and generative linguistics, and invited experts to give their perspectives of how the two may be combined going forward. I found their discussion very interesting, although a bit long (almost 100 pages). In this blog post, I will give a brief summary of it.

Generative Linguistics and Neural Networks at 60: Foundation, Friction, and Fusion

Research in generative syntax and neural networks began at the same time in 1957, and were both broadly considered under AI, but the two schools mostly stayed separate, at least for a few decades. In neural network research, Rosenblatt proposed the perceptron learning algorithm and realized that you needed hidden layers to learn XOR, but didn’t know of a procedure to train multi-layer NNs (backpropagation wasn’t invented yet). In generative grammar, Chomsky studied natural language like formal languages, and proposed controversial transformational rules. Interestingly, both schools faced challenges from learnability of their systems.

Above: Frank Rosenblatt and Noam Chomsky, two pioneers of neural networks and generative grammar, respectively.

The first time these two schools were combined was in 1986, when a RNN was used to learn a probabilistic model of past tense. This shows that neural networks and generative grammar are not incompatible, and the dichotomy is a false one. Another method of combining them comes from Harmonic Optimality Theory in theoretical phonology, which extends OT to continuous constraints and the procedure for learning them is similar to gradient descent.

Neural models have proved to be capable of learning a remarkable amount of syntax, despite having a lot less structural priors than Chomsky’s model of Universal Grammar. At the same time, they fail with certain complex examples, so maybe it’s time to add back some linguistic structure.

Linzen’s Response

Linguistics and DL can be combined in two ways. First, linguistics is useful for constructing minimal pairs for evaluating neural models, when such examples are hard to find in natural corpora. Second, neural models can be quickly trained on data, so they’re useful for testing learnability. By comparing human language acquisition data with various neural architectures, we can gain insights about how human language acquisition works. (But I’m not sure how such a deduction would logically work.)

Potts’s Response

Formal semantics has not had much contact with DL, as formal semantics is based around higher-order logic, while deep learning is based on matrices of numbers. Socher did some work of representing tree-based semantic composition as operations on vectors.

Above: Formal semantics uses higher-order logic to build representations of meaning. Is this compatible with deep learning?

In several ways, semanticists make different assumptions from deep learning. Semantics likes to distinguish meaning from use, and consider compositional meaning separately from pragmatics and context, whereas DL cares most of all about generalization, and has no reason to discard context or separate semantics and pragmatics. Compositional semantics does not try to analyze meaning of lexical items, leaving them as atoms; DL has word vectors, but linguists criticize that individual dimensions of word vectors are not easily interpretable.

Rawski and Heinz’s Response

Above: Natural languages exhibit features that span various levels of the Chomsky hierarchy.

The “no free lunch” theorem in machine learning says that you can’t get better performance for free, any gains in some problems must be compensated by decreases in performance on other problems. A model performs well if it has an inductive bias well-suited for the type of problems it applies to. This is true for neural networks as well, and we need to study the inductive biases in neural networks: which classes of languages in the Chomsky hierarchy are NNs capable of learning? We must not confuse ignorance of bias with absence of bias.

Berent and Marcus’s Response

There are significant differences between how generative syntax and neural networks view language, that must be resolved before the fields can make progress with integration. The biggest difference is the “algebraic hypothesis” — the assumption that there exists abstract algebraic categories like Noun, that’s distinct from their instances. This allows you to make powerful generalizations using rules that operate on abstract categories. On the other hand, neural models try to process language without structural representations, and this results in failures in generalizations.

Dunbar’s Response

The central problem in connecting neural networks to generative grammar is the implementational mapping problem: how do you decide if a neural network N is implementing a linguistic theory T? The physical system might not look anything like the abstract theory, eg: implementing addition can look like squiggles on a piece of paper. Some limited classes of NNs may be mapped to harmonic grammar, but most NNs cannot, and the success criterion is unclear right now. Future work should study this problem.

Pearl’s Response

Neural networks learn language but don’t really try to model human neural processes. This could be an advantage, as neural models might find generalizations and building blocks that a human would never have thought of, and new tools in interpretability can help us discover these building blocks contained within the model.

Leave a comment